How Does Pruning Impact Long-Tailed Multi-Label Medical Image Classifiers?

Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance. However, the nuanced ways in which pruning impacts model behavior are not well understood, particularly for long-tailed, multi-label datasets commonly found in clinical settings. This knowledge gap could have dangerous implications when deploying a pruned model for diagnosis, where unexpected model behavior could impact patient well-being. To fill this gap, we perform the first analysis of pruning’s effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). On two large CXR datasets, we examine which diseases are most affected by pruning and characterize class “forgettability” based on disease frequency and co-occurrence behavior. Further, we identify individual CXRs where uncompressed and heavily pruned models disagree, known as pruning-identified exemplars (PIEs), and conduct a human reader study to evaluate their unifying qualities. We find that radiologists perceive PIEs as having more label noise, lower image quality, and higher diagnosis difficulty. This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification. All code, model weights, and data access instructions can be found at https://github.com/VITA-Group/PruneCXR.


Introduction
Deep learning has enabled significant progress in image-based computer-aided diagnosis [26,10,33,23,8].However, the increasing memory requirements of deep neural networks limit their practical deployment in hardware-constrained environments.One promising approach to reducing memory usage and inference latency is model pruning , which aims to remove redundant or unimportant model weights [21].Since modern deep neural networks are often overparameterized, they can be heavily pruned with minimal impact on overall performance [34,20,22,6].This being said, the impact of pruning on model behavior beyond high-level performance metrics like top-1 accuracy remain unclear.This gap in understanding has major implications for real-world deployment of neural networks for high-risk tasks like disease diagnosis, where pruning may cause unexpected consequences that could potentially threaten patient well-being.
To bridge this gap, this study aims to answer the following guiding questions by conducting experiments to dissect the differential impact of pruning: Q1.What is the impact of pruning on overall performance in long-tailed multi-label medical image classification?Q2.Which disease classes are most affected by pruning and why?Q3.How does disease co-occurrence influence the impact of pruning?Q4.Which individual images are most vulnerable to pruning?We focus our experiments on thorax disease classification on chest X-rays (CXRs), a challenging long-tailed and multi-label computer-aided diagnosis problem, where patients may present with multiple abnormal findings in one exam and most findings are rare relative to the few most common diseases [12].
This study draws inspiration from Hooker et al. [13], who found that pruning disparately impacts a small subset of classes in order to maintain overall performance.The authors also introduced pruning-identified exemplars (PIEs), images where an uncompressed and heavily pruned model disagree.They discovered that PIEs share common characteristics such as multiple salient objects and noisy, fine-grained labels.While these findings uncover what neural networks "forget" upon pruning, the insights are limited to highly curated natural image datasets where each image belongs to one class.Previous studies have shown that pruning can enhance fairness [31], robustness [1], and efficiency for medical image classification [32,4,7] and segmentation [25,24,29,16,3] tasks.However, these efforts also either focused solely on high-level performance or did not consider settings with severe class imbalance or co-occurrence.
Unlike existing work, we explicitly connect class "forgettability" to the unique aspects of our problem setting: disease frequency (long-tailedness) and disease cooccurrence (multi-label behavior).Since many diagnostic exams, like CXR, are long-tailed and multi-label, this work fills a critical knowledge gap enabling more informed deployment of pruned disease classifiers.We hope that our findings can provide a foundation for future research on pruning in clinically realistic settings.labels for each image by adding five new rare disease findings parsed from radiology reports.This creates a challenging long-tailed classification problem, with training class prevalence ranging from under 100 to over 70,000 (Supplement).NIH-CXR-LT contains 112,120 CXRs, each labeled with at least one of 20 classes, while MIMIC-CXR-LT contains 257,018 frontal CXRs labeled with at least one of 19 classes.Each dataset was split into training (70%), validation (10%), and test (20%) sets at the patient level.
Model Pruning & Evaluation.Following Hooker et al. [13], we focus on global unstructured L1 pruning [34].After training a disease classifier, a fraction k of weights with the smallest magnitude are "pruned" (set to zero); for instance, k = 0.9 means 90% of weights have been pruned.While area under the receiver operating characteristic curve is a standard metric on related datasets [26,28,30], it can become heavily inflated in the presence of class imbalance [5,2].Since we seek a metric that is both resistant to imbalance and captures performance across thresholds (as choosing a threshold is non-trivial in the multilabel setting [27]), we use average precision (AP) as our primary metric.

Assessing the Impact of Pruning
Experimental Setup.We first train a baseline model to classify thorax diseases on both NIH-CXR-LT and MIMIC-CXR-LT.The architecture used was a ResNet50 [9] with ImageNet-pretrained weights and a sigmoid cross-entropy loss.For full training details, please see the Supplemental Materials and code repository.Following Hooker et al. [13], we then repeat this process with 30 unique random initializations, performing L1 pruning at a range of sparsity ratios k ∈ {0, 0.05, . . ., 0.9, 0.95} on each model and dataset.Using a "population" of 30 models allows for reliable estimation of model performance at each sparsity ratio.We then analyze how pruning impacts overall, disease-level, and image-level model behavior with increasing sparsity as described below.
Overall & Class-Level Analysis.To evaluate the overall impact of pruning, we compute the mean AP across classes for each sparsity ratio and dataset.We use Welch's t-test to assess performance differences between the 30 uncompressed models and 30 k-sparse models.We then characterize the class-level impact of pruning by considering the relative change in AP from an uncompressed model to its k-sparse counterpart for all k.Using relative change in AP allows for comparison of the impact of pruning regardless of class difficulty.We then define the forgettability curve of a class c as follows: ..,30} k∈{0,0.05,...,0.9,0.95} where AP i,k,c := AP of the i th model with sparsity k on class c, and med(•) := median across all 30 runs.We analyze how these curves relate to class frequency and co-occurrence using Pearson (r) and Spearman (ρ) correlation tests.(FCD), the mean squared error (MSE) between the forgettability curves of each disease.FCD quantifies how similar two classes are with respect to their forgetting behavior over all sparsity ratios.Ordinary least squares (OLS) linear regression is employed to understand the interaction between difference in class frequency and class co-occurrence with respect to FCD for a given disease pair.

Pruning-Identified Exemplars (PIEs)
Definition.After evaluating the overall and class-level impact of pruning on CXR classification, we investigate which individual images are most vulnerable to pruning.Like Hooker et al. [13], we consider PIEs to be images where an uncompressed and pruned model disagree.Letting C be the number of classes, we compute the average prediction 1 30 i ŷ0 ∈ R C of the uncompressed models and average prediction 1 30 i ŷ0.9 ∈ R C of the L1-pruned models at 90% sparsity for all NIH-CXR-LT test set images.Then the Spearman rank correlation σ( 1 30 i ŷ0 , 1 30 i ŷ0.9 ) represents the agreement between the uncompressed and heavily pruned models for each image; we define PIEs as images whose correlation falls in the bottom 5 th percentile of test images.
Analysis & Human Study.To understand the common characteristics of PIEs, we compare how frequently (i) each class appears and (ii) images with d = 0, . . ., 3, 4+ simultaneous diseases appear in PIEs relative to non-PIEs.To further analyze qualities of CXRs that require domain expertise, we conducted a human study to assess radiologist perceptions of PIEs.Six board-certified attending radiologists were each presented with a unique set of 40 CXRs (half PIE, half non-PIE).Each image was presented along with its ground-truth labels and the following three questions: 1 Fig. 2: "Forgettability curves" depicting relative change in AP (median across 30 runs at each sparsity ratio) upon L1 pruning for a subset of classes.

What is the overall effect of pruning?
We find that under L1 pruning, the first sparsity ratio causing a significant drop in mean AP is 65% for NIH-CXR-LT (P < 0.001) and 60% for MIMIC-CXR-LT (P < 0.001) (Fig. 1, left).This observation may be explained by the fact that ResNet50 is highly overparameterized for this task.Since only a subset of weights are required to adequately model the data, the trained classifiers have naturally sparse activations (Fig. 1, right).For example, over half of all learned weights have magnitude under 0.01.However, beyond a sparsity ratio of 60%, we observe a steep decline in performance with increasing sparsity for both datasets.

Which diseases are most vulnerable to pruning and why?
Class forgettability curves in Fig. 2 depict the relative change in AP by sparsity ratio for a representative subset of classes.Although these curves follow a similar general trend to Fig. 1, some curves (i) drop earlier and (ii) drop more considerably at high sparsity.Notably, we find a strong positive relationship between training class frequency and (i) the first sparsity ratio at which a class experienced a median 20% relative drop in AP (ρ = 0.61, P = 0.005 for NIH-CXR-LT; ρ = 0.93, P ≪ 0.001 for MIMIC-CXR-LT) and (ii) the median relative change in AP at 95% sparsity (ρ = 0.75, P < 0.001 for NIH-CXR-LT; ρ = 0.75, P < 0.001 for MIMI-CXR-LT).These findings indicate that, in general, rare diseases are forgotten earlier (Fig. 3, left) and are more severely impacted at high sparsity (Fig. 3, right).

How does disease co-occurrence influence class forgettability?
Our analysis reveals that for NIH-CXR-LT, the absolute difference in log test frequency between two diseases is a strong predictor of the pair's FCD (ρ = 0.64, P ≪ 0.001).This finding suggests that diseases with larger differences in prevalence exhibit more distinct forgettability behavior upon L1 pruning (Fig. 4, left).To account for the multi-label nature of thorax disease classification, we also explore the relationship between intersection over union (IoU) -a measure of co-occurrence between two diseases -and FCD.Our analysis indicates that the IoU between two diseases is negatively associated with FCD (ρ = −0.47,P ≪ 0.001).This suggests that the more two diseases co-occur, the more similar their forgetting trajectories are across all sparsity ratios (Fig. 4, right).For example, the disease pair (Infiltration, Hernia) has a dramatic difference in prevalence (|LogFreqDiff| = 4.58) and rare co-occurrence (IoU 1/4 = 0.15), resulting in an extremely high FCD for the pair of diseases.
We also find, however, that there is a push and pull between differences in individual class frequency and class co-occurrence with respect to FCD.To illustrate, consider the disease pair (Emphysema, Pneumomediastinum) marked in black in Figure 4.These classes have an absolute difference in log frequency of 2.04, which would suggest an FCD of around 0.58.However, because Emphysema and Pneumomediastinum co-occur relatively often (IoU 1/4 = 0.37), their forgettability curves are more similar than prevalence alone would dictate, resulting in a lower FCD of 0.18.To quantify this effect, we obtain an OLS model that fitted FCD as a function of |LogFreqDiff|, IoU 1/4 , and their interaction: We observe a statistically significant interaction effect between the difference in individual class frequency and class co-occurrence on FCD (β 3 = −0.31,P = 0.005).Thus, for disease pairs with a very large difference in prevalence, the effect of co-occurrence on FCD is even more pronounced (Supplement).The dotted line represents the 1:1 ratio (equally frequent in PIEs vs. non-PIEs).

What do pruning-identified CXRs have in common?
For NIH-CXR-LT, we find that PIEs are more likely to contain rare diseases and more likely to contain 3+ simultaneous diseases when compared to non-PIEs (Fig. 5).The five rarest classes appear 3-15x more often in PIEs than non-PIEs, and images with 4+ diseases appear 3.2x more often in PIEs.
In a human reader study involving 240 CXRs from the NIH-CXR-LT test set (120 PIEs and 120 non-PIEs), radiologists perceived that PIEs had more label noise, lower image quality, and higher diagnosis difficulty (Fig. 6).However, due to small sample size and large variability, these differences are not statistically significant.Respondents fully agreed with the label 55% of the  time for PIEs and 57.5% of the time for non-PIEs (P = 0.35), gave an average image quality of 3.6 for PIEs and 3.8 for non-PIEs (P = 0.09), and gave an average diagnosis difficulty of 2.5 for PIEs and 2.05 for non-PIEs (P = 0.25).
Overall, these findings suggest that pruning identifies CXRs with many potential sources of difficulty, such as containing underrepresented diseases, (partially) incorrect labels, low image quality, and complex disease presentation.

Discussion & Conclusion
In conclusion, we conducted the first study of the effect of pruning on multi-label, long-tailed medical image classification, focusing on thorax disease diagnosis in CXRs.Our findings are summarized as follows: 1.As observed in standard image classification, CXR classifiers can be heavily pruned (up to 60% sparsity) before dropping in overall performance.2. Class frequency is a strong predictor of both when and how severely a class is impacted by pruning.Rare classes suffer the most.3. Large differences in class frequency lead to dissimilar "forgettability" behavior and stronger co-occurrence leads to more similar forgettability behavior.
-Further, we discover a significant interaction effect between these two factors with respect to how similarly pruning impacts two classes.4. We adapt PIEs to the multi-label setting, observing that PIEs are far more likely to contain rare diseases and multiple concurrent diseases.
-A radiologist study further suggests that PIEs have more label noise, lower image quality, and higher diagnosis difficulty.
It should be noted that this study is limited to the analysis of global unstructured L1 (magnitude-based) pruning, a simple heuristic for post-training network pruning.Meanwhile, other state-of-the-art pruning approaches [20,22,6] and model compression techniques beyond pruning (e.g., weight quantization [14] and knowledge distillation [11]) could be employed to strengthen this work.Additionally, since our experiments only consider the ResNet50 architecture, it remains unclear whether other training approaches, architectures, or compression methods could mitigate the adverse effects of pruning on rare classes.In line with recent work [15,17,19], future research may leverage the insights gained from this study to develop an algorithm for improved long-tailed learning on medical image analysis tasks.For example, PIEs could be interpreted as salient, difficult examples that warrant greater weight during training.Conversely, PIEs may just as well be regarded as noisy examples to be ignored, using pruning as a tool for data cleaning.

Fig. 3 :Fig. 4 :
Fig.3: Relationship between class "forgettability" and frequency.We characterize which classes are forgotten first (left) and which are most forgotten (right).

Fig. 5 :
Fig. 5: Unique characteristics of PIEs.Presented is the ratio of class prevalence (left) and number of diseases per image (right) in PIEs relative to non-PIEs.The dotted line represents the 1:1 ratio (equally frequent in PIEs vs. non-PIEs).

Fig. 6 :
Fig. 6: Human study results describing radiologist perception of PIEs vs. non-PIEs.Mean ± standard deviation (error bar) radiologist scores are presented.
Incorporating Disease Co-occurrence Behavior.For each unique pair of NIH-CXR-LT classes, we compute the Forgettability Curve Dissimilarity . Do you fully agree with the given label?[Yes/No] 2. How would you rate the image quality?[1-5 Likert] 3. How difficult is it to properly diagnose this image?[1-5 Likert] We use the Kruskal-Wallis test to evaluate differential perception of PIEs.